{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../../img/ods_stickers.jpg\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 红酒质量数据回归探索"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "红酒质量数据集同样来自于 UCI 数据集网站。首先，挑战导入所需模块。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.linear_model import Lasso, LassoCV, LinearRegression\n",
    "from sklearn.metrics.regression import mean_squared_error\n",
    "from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "读取并预览数据集，同时查看数据集列属性。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv(\"../../data/winequality-white.csv\", sep=\";\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面，将数据集按 7:3 分割成训练集和测试集，设置 `random_state=17`，同时使用 `StandardScaler` 对特征数据规范化。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = data[\"quality\"]\n",
    "X = data.drop(\"quality\", axis=1)\n",
    "\n",
    "X_train, X_holdout, y_train, y_holdout = train_test_split(\n",
    "    X, y, test_size=0.3, random_state=17\n",
    ")\n",
    "scaler = StandardScaler()\n",
    "X_train_scaled = scaler.fit_transform(X_train)\n",
    "X_holdout_scaled = scaler.transform(X_holdout)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 线性回归"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用普通最小二乘法线性回归训练一个回归预测模型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "linreg = LinearRegression()\n",
    "linreg.fit(X_train_scaled, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i class=\"fa fa-question-circle\" aria-hidden=\"true\"> 问题：</i>训练数据和测试数据上的平均绝对误差 MSE 值是多少？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，你需要将回归系数按绝对值从大到小进行排序。值得注意的是，特征所对应的拟合系数绝对值越大，即代表该特征对目标值的影响越大。这里建议你使用 `pandas.DataFrame` 对数据进行展示。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i class=\"fa fa-question-circle\" aria-hidden=\"true\"> 问题：</i>该线性回归模型中，哪些特征对目标值的影响更大？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lasso 回归"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用规范化后的数据训练一下 Lasso 回归模型，设置 $\\alpha = 0.01$，且 `random_state=17`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lasso1 = Lasso(alpha=0.01, random_state=17)\n",
    "lasso1.fit(X_train_scaled, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i class=\"fa fa-question-circle\" aria-hidden=\"true\"> 问题：</i> 该 Lasso 回归模型中，哪些特征对目标值的影响最小？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，使用 LassoCV 来寻找最合适的 $\\alpha$ 参数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "alphas = np.logspace(-6, 2, 200)\n",
    "lasso_cv = LassoCV(random_state=17, cv=5, alphas=alphas)\n",
    "lasso_cv.fit(X_train_scaled, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lasso_cv.alpha_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i class=\"fa fa-question-circle\" aria-hidden=\"true\"> 问题：</i> 上面调参得到的 Lasso 回归模型中，哪些特征对目标值的影响最小？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i class=\"fa fa-question-circle\" aria-hidden=\"true\"> 问题：</i> 调参得到的 Lasso 回归模型中，训练数据和测试数据上的平均绝对误差 MSE 值是多少？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 随机森林"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用默认参数训练一个随机森林回归模型，设置 `random_state=17`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "forest = RandomForestRegressor(random_state=17)\n",
    "forest.fit(X_train_scaled, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "和上面的过程相似，接下来请自行对随机森林回归模型进行调餐，并探索不同特征对目标值的影响程度，最终得到 MSE 值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    \"Mean squared error (train): %.3f\"\n",
    "    % mean_squared_error(y_train, forest.predict(X_train_scaled))\n",
    ")\n",
    "print(\n",
    "    \"Mean squared error (cv): %.3f\"\n",
    "    % np.mean(\n",
    "        np.abs(\n",
    "            cross_val_score(\n",
    "                forest, X_train_scaled, y_train, scoring=\"neg_mean_squared_error\"\n",
    "            )\n",
    "        )\n",
    "    )\n",
    ")\n",
    "print(\n",
    "    \"Mean squared error (test): %.3f\"\n",
    "    % mean_squared_error(y_holdout, forest.predict(X_holdout_scaled))\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "forest_params = {\"max_depth\": list(range(18, 20)), \"max_features\": list(range(6, 8))}\n",
    "\n",
    "locally_best_forest = GridSearchCV(\n",
    "    RandomForestRegressor(n_jobs=-1, random_state=17),\n",
    "    forest_params,\n",
    "    scoring=\"neg_mean_squared_error\",\n",
    "    n_jobs=-1,\n",
    "    cv=5,\n",
    "    verbose=True,\n",
    ")\n",
    "locally_best_forest.fit(X_train_scaled, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "locally_best_forest.best_params_, locally_best_forest.best_score_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    \"Mean squared error (cv): %.3f\"\n",
    "    % np.mean(\n",
    "        np.abs(\n",
    "            cross_val_score(\n",
    "                locally_best_forest.best_estimator_,\n",
    "                X_train_scaled,\n",
    "                y_train,\n",
    "                scoring=\"neg_mean_squared_error\",\n",
    "            )\n",
    "        )\n",
    "    )\n",
    ")\n",
    "print(\n",
    "    \"Mean squared error (test): %.3f\"\n",
    "    % mean_squared_error(y_holdout, locally_best_forest.predict(X_holdout_scaled))\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rf_importance = pd.DataFrame(\n",
    "    locally_best_forest.best_estimator_.feature_importances_,\n",
    "    columns=[\"coef\"],\n",
    "    index=data.columns[:-1],\n",
    ")\n",
    "rf_importance.sort_values(by=\"coef\", ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最终，你应该可以发现葡萄酒质量数据集中特征和目标之间呈现较强的非线性关系，所以随机森林在这项任务中工作得更好。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color: #e6e6e6; margin-bottom: 10px; padding: 1%; border: 1px solid #ccc; border-radius: 6px;text-align: center;\"><a href=\"https://nbviewer.jupyter.org/github/shiyanlou/mlcourse-answers/tree/master/\" title=\"挑战参考答案\"><i class=\"fa fa-file-code-o\" aria-hidden=\"true\"> 查看挑战参考答案</i></a></div>"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  },
  "name": "lesson8_part1_kmeans.ipynb"
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
